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Abstract 

According to the probability ranking principle, the document set 
with the highest values of probability of relevance optimizes infor- 
mation retrieval effectiveness given the probabilities are estimated as 
accurately as possible. The key point of this principle is the sep- 
aration of the document set into two subsets with a given level of 
fallout and with the highest recall. If subsets of set measures are 
replaced by subspaces and space measures, we obtain an alternative 
theory stemming from Quantum Theory. That theory is named af- 
ter vector probability because vectors represent event like sets do in 
classical probability. The paper shows that the separation into vec- 
tor subspaces is more effective than the separation into subsets with 
the same available evidence. The result is proved mathematically and 
verified experimentally. In general, the paper suggests that quantum 
theory is not only a source of rhetoric inspiration, but is a sufficient 
condition to improve retrieval effectiveness in a principled way. 

1 Introduction 

Information Retrieval (IR) systems decide about the relevance under condi- 
tions of uncertainty. As a measure of uncertainty is necessary, a probability 
theory defines the event space and the probability distribution. The research 
in probabilistic IR is based on the classical theory of probability, which de- 
scribes events and probability distributions using, respectively, sets and set 
measures obeying the usual axioms stated in [6]. Set theory is not the unique 
way to define probability though. 



If subsets and set measures are replaced by vector subspaces and space- 
based measures, we obtain an alternative theory called, in this paper, vector 
probability. Although this theory stems from Quantum Theory, we prefer to 
use "vector" because vectors are sufficient to represent events like sets rep- 
resent events within classical probability, the latter being the feature of our 
interest, whereas the "quantumness" of IR is out of the scope of this paper, 
which explains that the replacement of classical with vector probability is 
crucial to ranking. 

Ranking is an essential task in IR. Indeed, it should not come as a surprise 
that the Probability Ranking Principle (PRP) reported in [10] is by far the 
most important theoretical result to date because it is an incisive factor in 
effectiveness. Although probabilistic IR systems reach good results, ranking 
is far from being perfect because irrelevant documents are often ranked at 
the top of, or useful units are missed from the retrieved document list. 

Besides the definition of weighting schemes and ranking algorithms, new 
results can be achieved if the research in IR views problems from a new the- 
oretical perspective. We propose vector probability to describe the events 
and probabilities underlying an IR system. We show that ranking in accor- 
dance with vector probability is more effective than ranking in accordance 
with classical probability, given that the same evidence is available for prob- 
ability estimation. The effectiveness is measured in terms of probability of 
correct decision or, equivalently, of probability of error. The result is proved 
mathematically and verified experimentally. 

Although the use of the mathematical apparatus of Quantum Theory is 
pivotal in showing the superiority of vector probability (at least in terms of 
retrieval effectiveness), this paper does not necessarily end in an investigation 
or assertion of quantum phenomena in IR. Rather, we argue that vector 
probability and then Quantum Theory is sufficient to go beyond the state 
of the art of IR, thus supporting the hypothesis stated in [13] according to 
which Quantum Theory may pave the way for a breakthrough in IR research. 

We organize the paper as follows. The paper gives an intuitive view of 
the our contribution in Section [2] and sketches how an IR system built on 
the premise of Quantum Theory can outperform any other system. Section [3] 
briefly reviews the classical probability of relevance before introducing the 
notion of vector probability in Section HJ Section [5] is one of the central 
sections of the paper because it introduces the optimal vectors which are ex- 
ploited in Section [6] where we provide our main result, that is, the fact that 
a system that ranks documents according to the probability of occurrence of 
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(a) An IR system based on classical probability. A document is like an emitter of binary symbols 
referrint to term occurrence. The probability that 1 occurs depends on a parameter that in turn 
depends on relevance. After training the probability distributions, the system B, which acts as a 
detector, decides whether a symbol is emitted by a relevant document. 
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(b) An IR system equipped with an oracle based on vector probability works as the system of Fig- 
ure 1(a) until symbols reaches an oracle which produces other symbols. The symbols produced by the 
oracle are vectors. After training the probability distributions, the system B, which acts as a detector, 
decides whether a vector is emitted by a relevant document. 



Figure 1: IR decision as an emitter and detector problem. 



the optimal vectors in the documents is always superior to a system which 
ranks documents according to the classical probability of relevance which 
is based on sets. Section [7] addresses the case of BM25 and how it can be 
framed within the theory. An experimental study is illustrated in Section [H] 
for measuring the degree to which vector probabilistic models outperforms 
classical probabilistic models if a realistic test collectio is used. The feasibil- 
ity of the theory is strongly dependent on the existence of an oracle which 
tells whether optimal vector occur in documents; this issue is discussed in 
Section EO After surveying the related work in Section [TD1 we conclude with 
Section [TTJ The appendix includes the definitions used in the paper and the 



proofs of the theoretical results. 



2 Intuitive View 

Before entering into mathematics, Figure [T] depicts an intuitive view of what 
is illustrated in the rest of the paper. Suppose that relevance and non- 
relevance are two events mo, mi occurring with prior probability 1 — £ and £, 
respectively. Document A is in either relevant or non-relevant. Let's view A 
as an emitter of binary symbols , 1 referring to presence or absence of a given 
index term. (We use the binary symbol and relevance for the sake of clarity.) 
On the other side, an IR system B acts as a detector which has to decide 
whether a symbol comes out from either a relevant or a non-relevant docu- 
ment. The B's decision is taken on the basis of some feedback mechanism and 
on the relevance and non-relevance probability distributions which has been 
appropriately estimated on received symbols. An IR system that implements 
classical probability decides about relevance without any transformation of 



the received symbols (Figure 1(a) ) whereas a IR system that implements vec- 
tor probability decides about relevance after a transformation of the received 
symbols carried out by an oracle which outputs new symbols which cannot 



be straightforwardly derived from the received symbols (Figure 1(b) ) but can 
be defined as vectors jl]. In this paper, we show that when B is equipped 
with such an oracle, then it does significantly outperform any other IR sys- 
tem which implements any classical probabilistic model. We theoretically 
measure the improvement in effectiveness on the basis of a mathematical 
proposition which holds for every IR system described in Figure [H 



3 Probability of Relevance 



An IR system performing like detector B of Figure 1(a) computes the prob- 
abilities that a symbol (e.g., an index term) occurs in relevant documents 
and in non-relevant documents. According to the intuitive view in terms of 
emitters and detectors provided in Section El the probability that a sym- 
bol (e.g., an index term) occurs in relevant documents and in non-relevant 
documents is called, respectively, probability of detection (Pj) and probability 
of false alarm (Pq). These probabilities are also known as expected recall 
and fallout, respectively [10]. The system decides whether a document is 
retrieved by 

where A is an appropriate threshold, and ranks the retrieved documents by 
using the left side of ([I]). For instance, when indipendent Bernoulli random 
variables are used, we have that 

p d oc n* ,/v (i - />,!' ' Po oc rr) =1 gf (l - '/,)' ' 

where pj, qj are the probabilities that term j occurs in relevant, non-relevant 
documents and the x/s belong to the region of acceptance. Depending on 
the available evidence the probabilities are estimated as accurately as possi- 
ble and are transformed into weights (e.g., the binary weight or the BM25 
illustrated in [121 page 340]). 

The Probability Ranking Principle (PRP) defines the optimal document 
subsets in terms of expected recall and fallout. Thus, the optimal document 
subsets are those maximizing effectiveness. The PRP states that, if a cut-off 
is defined for expected fallout, that is, probability of false alarm, we would 
maximize expected recall if we included in the retrieved set those documents 
with the highest probability of relevance pUl page 297], that is, probability of 
detection. When a collection is indexed, each document belongs to subsets 
labeled by the document index terms and the documents in a subset are 
indistinguishable. In fact, (JTJ optimally ranks subsets whose documents are 
represented in the same way (e.g., the documents which are indexed by a 
given group of terms or share the same set of feature values). In terms of 
decision, if fallout is fixed, the PRP permits to decide whether a document 
(subset) should be retrieved with the minimum probability of error. 



4 Vector Probability 



When using classical probability term occurrence would correspond to dis- 
joint document subsets (i.e., a subset corresponds to an index term occurring 
in every document of the subset). When using vector probability, term oc- 
currence corresponds to a document vector subspace which is spanned by the 
orthonormal vector either |0) or |1) representing, respectively, absence and 
presence of a given term. (For the sake of clarity, we consider a single term, 
binary weights and binary relevance as depicted in Figure [TJ) 

As relevance is an event, two vectors represent binary relevance: a rele- 
vance vector |m ) represents non-relevance state and an orthogonal relevance 
vector | mi) represents relevance state. Relevance vectors and occurrence vec- 
tors belong to a common vector space and thus can be defined in terms of a 
given orthonormal basis of that space. 

In a vector space, a random variable is a collection of values and of vectors 
(or projectors). The vectors are mutually orthonormal and 1:1 correspon- 
dence with the values. 

Let x be a random variable value (e.g., term occurrence) and m be a 
conditioning event (e.g., relevance). In Quantum Theory, \m) is also known as 
state vector and is a specialization of a density operator, that is, a Hermitian 
and unitary trace operator. The vector probability that x is observed given 
m is |(x|m)| 2 . When a density operator p and an event is represented by 
projector P, the vector probability of the event conditioned to the density 
operator is given by Born's rule, that is, tr(pP). When p = \m)(m\ and 
P = vector probability is a specialization of Born's rule. (See [5].) 

It is possible to show that 

Proposition 1 A classical probability distribution can be equivalently ex- 
pressed using vector probability. 

The proof is in the appendix. 

5 Optimal Vectors 

In this section, we reformulate the PRP by replacing subsets with vector 
subspaces, namely, we replace the notion of optimal document subset with 
that of optimal vectors (or, vector subspaces). Such a reformulation allows 
us to compute the optimal vectors that are more effective than the optimal 



document subsets. To this end, we define a density matrix representing a 
probability distribution that has no counterpart in, but that is an extension 
of classical probability Such a density matrix is the outer product of a 
relevance vector by itself. When classical probability is assumed, a decision 
under uncertainty conditions taken upon this density matrix is equivalent 
to ([T]) as illustrated in [TJ. (See the appendix and [7J as for the details.) 
When vector probability is assumed, a decision under uncertainty conditions 
taken upon this density matrix is based upon a different region of acceptance. 
Hence, we leverage the following Helstrom's lemma because it is the rule to 
compute the optimal vectors. 

Lemma 1 Let |m 1 ),|m ) be the relevance vectors. The optimal vectors 
|/xo), \ fJ>i) at the highest probability of detection at every probability of false 
alarm is given by the eigenvectors of 

\mi){mi\ - \\m )(m \ (2) 

whose eigenvalues are positive. 

Proof See g]. I 



The optimal vectors always exist due to the Spectral Decomposition the- 
orem [3]; therefore they are mutually orthogonal because are eigenvectors 
of ([2]); moreover, they can be defined in the space spanned by the relevance 
vectors. The angle between the relevance vectors \rrii), \itiq) determines the 



geometry of the decision of the emitter of Figure 1(b) - geometry means the 
probability distributions of the events. Therefore, the probability of correct 
decision and the probability of error are given by the angle between the two 
relevance vectors and by the angles between the vectors and the relevance 



vectors. Figure 2(a) depicts the geometry of the optimal vectors. (The figure 
is in the two-dimensional space for the sake of clarity, but the reader should 
generalize to higher dimensionality than two.) Suppose |ei), |eo) are two any 
other vectors. The angles 770, 771 between the vectors and the relevance vec- 
tors \m ), \mi) are related with the angle 7 between |m ), \mi) because the 
vectors are always mutually orthogonal and then the angle is | = % + 7 + 7 7i- 
The optimal vectors are achieved when the angles between an vector and a 
relevance vector are equal to 

= - 
2 V2 



« = ^-7) (3) 



The rotation of the non-optimal vectors such that ([2]) holds, yields the op- 
timal vectors \fi ) as Figure 2(b) illustrates: the optimal vectors are 
"symmetrically" located around the relevance vectors. If any two vectors 
are rotated in an optimal way, we can achieve the most effective document 
vector subspaces (or, vectors) in terms of expected recall and fallout. These 
vectors cannot be ascribed to the subsets yielded by dint of the PRP, the 
latter impossibility being called incompatibility [131 [3] . 



6 Vector Probability of Relevance 

In this section, we leverage Lemma [T] to introduce the optimal vectors in IR. 
We define |m ) and \mi) as: 

|m ) = ( V^ 1;m °i ) K> = ( Vfr( 1;mi > ) (4) 

Note that, according to Born's rule |(m |l)| 2 = p(l] m ) and |(mi|l)| 2 = 
p(l;rai), thus @ reproduce the classical probability distributions. 



If the oracle of Figure 1 (b) exists, an IR system performing like detector 
B computes the probabilities that the transformation of a binary symbol 
referring to an index term, into |^o) or occurs in relevant documents 
and in non-relevant documents. The former is called vector probability of 
detection (Qd) and the latter is called vector probability of false alarm (Qo). 
These probabilities are the analogous of Po,Pd- But, if |/i ),|/Ji) are the 
mutually exclusive symbols yielded by the oracle, we have that 

Qo = \(m \fii)\ 2 Q d =\(m 1 \ f i 1 )\ 2 (5) 

The latter expression is not the same probability distribution as (j4j) because it 
refers to different events. According to [1], we have that the vector probability 
of error and the vector probability of correct decision are defined, respectively, 
as 

Qe = \ (l - A/1" W ~0\X\ 2 ) Qc=l~Qe (6) 

Both probabilities depend on 



X\ 2 = y/p(l; m )p(l; m,) + y/(l - m ))(l - m x )) (7) 



which is a measure of the distance between two probability distributions as 
proved in [U [H]. As the probability distributions refer to relevance and 
non- relevance, \X\ 2 is a measure of the distance between relevance and non- 
relevance. An example may be useful. 

Suppose, for example, that the probability distributions are p(l;m ) = 
|, p(l;mi) = 1. If relevance and non-relevance are equiprobable, P e = 
l(P + l-P d ) = l and P c = \ (1 - P + P d ) = §. When A = 1, the optimal 
vectors are the eigenvectors of 

h J 

5 

that is, 

(I/io), |a*i» = ^ ~_\^k I) (8) 

These vectors can be computed in compliance with []]. Hence, Q e = 
(5 — y/E) , which is less than P e . 

The following theorem that is our main result shows that the latter ex- 
ample is not an exception. 

Theorem 1 Ifp(x;rrij),j = 0,1 are two arbitrary probability distributions 
conditioned to mo, mi, the latter indicating the probability distribution of 
term occurrence in non-relevant documents and in relevant documents, re- 
spectively, then 

Qe < Pe (9) 

Proof See Section dU I 
Hence, if we were able to find the optimal vectors, retrieval performance of the 



detector B of Figure 2(b) would always be higher than retrieval performance 



of the detector B of Figure 2 (a 



7 Getting Beyond the State-of-the-Art 

The development of the theory assumed binary weights for the sake of clarity. 
In the event of non-binary weights, e.g., BM25, we slightly fit the theory as 



follows. If we do the development of the BM25 illustrated in [12] in reverse, 
we can find that the underlying probability distribution is 

bit^rm) = B idP {l-mif{l-p{l-mi))-^ (10) 
tj = BM25 saturation factor (11) 

BiJ = < lo S pflim") 1 /-I \ / 1 (12) 

\ \l- P (l; mi ) ) 

n = max{i,} (13) 

thus b(tj;rrii) is the probability that tj for term j is observed under state 
mj. (5jj is just a normalization factor.) If b(tj] rrii) estimates p(xj] rrii), the 
theory still works because Theorem [T] is independent of the estimation of 
p(xj] rrii). Figure [3] depicts an IR system equipped with an oracle based on 
vector probability estimated with BM25. 



8 Experimental Study 

In this section we report an experimental study. Our study differs from the 
common studies conducted in usual IR evaluation because: (1) Theorem [1] 



already proves that an IR system working as the detector of Figure 2(b) will 
always be more effective than any other system, therefore, if the former were 
available, every test would confirm the theorem; (2) as an experimentation 
that compare two systems using, say, Mean Average Precision requires the 
implementation of the oracle, which cannot be at present implemented, what 
we can measure is only the degree to which an IR system working as the 



detector of Figure 2(b) will outperform any other system. 

We have tested the theory illustrated in the previous sections through 
experiments based on the TIPSTER test collection, disks 4 and 5. The ex- 
periments aimed at measuring the difference between P e and Q e by means of 
a realistic test collection. To this end, we have used the TREC-6, 7, 8 topic 
sets. The queries are topic titles. We have implemented the following test: 
p(x; m) has been computed for each topic, query word and m e {m , mi} by 
means of the usual relative frequency of the word within relevant (m = mi) 
or non-relevant (m = mo) documents. In particular, x = 1 means pres- 
ence, x = means absence. Thus, p(l;mo) is the estimated probability of 
occurrence in non-relevant documents and p(l;mi) is the estimated proba- 
bility of occurrence in relevant documents. We have shown in Section [7] that 



the improvement is independent of probability estimation and then of term 
weighting. 

Consider word crime of Topic No. 301; we have that mo) = 
and p(l;TOi) = J^. Hence, \X\ 2 = 0.998. (Relevance and non-relevance 
probability distributions are very close to each other.) Estimation has taken 
advantage of the availability of the relevance assessments and thus it has 
been computed on the basis of the explicit assessments made for each topic. 
Figure H] depicts P e ,Q e as function of the prior probability £. Q e is always 
greater than P e for every prior probability £. The vertical distance between 
the curves is due to the value of |AT| 2 , which also yields the shape of the Q e 
curve, meaning that crime discriminates between relevant and non-relevant 
documents to an extent depending on \X\ 2 and £. The average curves com- 
puted over all the query words and depicted in Figure [5] give an idea of the 
overall discriminative power of the topic. In particular, if the total frequen- 
cies within relevant and non-relevant documents are computed for each query 
word and a given topic, average probability of error is computed, for each 
prior probability. When £ is close to |, the curves are indistinguishable be- 
cause \X\ 2 is very close to 1. The situation radically changes when Topic No. 
344 is considered because \X\ 2 Jr; indeed, Figure |6] confirms that Q e 3> P e 
when £ 3> 0. 

We have also investigated the event that explicit relevance assessment 
cannot be used because of the lack of reliable judgements for a suitable 
number of documents. In this event, it is customary to state that p(l; mo) = 



and P e can still be computed as function of p(l;mo) because the latter are 
still valid estimations. In particular, we have that 



Thus, we can analyze the probabilities of error as functions of £ and 
p(l;m ). Figure [7] depicts how P e ,Q e change with £ and p(l;m ); this plot 
does not depend on a topic, it rather depends on m ), which is a measure 
of discrimination power of a query word since IDF = — log p(l; m ) . The 
plot confirms the intuition that P e increases when p(l;mo) increases, that 
is, when IDF decreases. In particular, P e , Q e are close to each other when 
little information about the proportion of relevant documents is available 
(i.e., £ ~ |) and the IDF is not large enough to make a term discriminative. 
Nevertheless, if some information about the proportion of relevant documents 




Although pseudo- relevance data is assumed, \X 




(14) 



is available (i.e., £ approaches either 1 or 0), Q e becomes much smaller than 
P e even when the IDF is small (see the bottom-right side of the plot of 
Figure [7]). Table [1] reports the average relative topic word frequency for 
each topic computed over the query words. The relative frequency gives a 
measure of query difficulty and can be used to "access" to the plot of Figure [7] 
to have an idea of P e when using a classical probabilistic IR model and of 
the improvement that can be achieved through an oracle which can produce 
the optimal vectors on the basis of the same available evidence as that used 
to estimate the p(x; m)'s. 

9 Regarding the Oracle 

This section explains why the design of the oracle is difficult. To this end, 
the section refers to some results of logic in IR reported in some detail in |13j . 
When binary term occurrence is considered, there are two mutually exclu- 
sive events, i.e., either presence (0) or absence (1). The classical probability 
used in IR is based on Neyman-Pearson's lemma which states that the set 
of term occurrences can be partitioned into two disjoint regions: one re- 
gion includes all the frequencies such that relevance will be accepted; the 
other region denotes rejection[8j. If a term is observed from documents 
and only presence/absence is observed, the possible regions of acceptance 
are 0, {0}, {1}, {0, 1}. When using vectors, the regions of acceptance are 
0, |0)(0|, |1)(1|,I which are the projectors to, respectively, the null subspace, 
the subspace spanned by |0), the subspace spanned by |1), and the entire 
space. 

Consider the symbols | /x ) , \rrii) emitted by the oracle. Vector probability 
is based on Lemma [T] which states that the set of symbols can be partitioned 
into two disjoint regions: one region includes all the symbols such that rele- 
vance will be accepted; the other region denotes rejection [3]. If a symbol is 
observed from the oracle, the regions of acceptance are 0, | / uo)(A t o| ) | yui) {/^i | , I- 

The problem is that the subspaces spanned by |/i ), \ m i) cannot be de- 
fined in terms of set operations on the subspaces spanned by |0), |1). Vector 
subspaces are equivalent to subsets and they then can be subject to set op- 
erations if they are mutually orthogonal [2]. 

To explain the incompatibility between sets and vectors, we illustrate the 
fact that the distributive law cannot be admitted in vector spaces as it is 
in set spaces. Figure [S] shows a three-dimensional vector space spanned by 



I e i)) I e 2 ) , |e 3 ) . The ray (i.e. one- dimensional subspace) L x is spanned by \x), 
the plane (i.e. two-dimensional subspace) L XjV is spanned by \x), \y). Note 
that L eite2 = L XyV = L eitV and so on. According to [5, page 191], consider 
the vector subspace L e2 A (L y V L x ) provided that A means "intersection" 
and V means "span" (and not set union). Since L y V L x = L XiV = L eifi2 , 
L e2 A (L y V L x ) = L e2 A L eii62 = L e2 . However, (L e2 A L y ) V (L e2 A L x ) = 
because L e , 2 A L y = and L e2 A L x = 0, therefore 

L e2 A (L y V L x ) ± (L e2 A L y ) V (L £2 A L x ) (15) 

thus meaning that the distributive law does not hold, hence, set operations 
cannot be applied to vector subspaces. 

Incompatibility, that is, the invalidity of the distributive law, is due to 
the obliquity of the vectors. Thus, the optimal vectors cannot be defined 
in terms of occurrence vectors due to obliquity and a new "logic" must be 
searched. 

The precedent example points out the issue of the measurement of the 
optimal vectors. Measurement means the actual finding of the presence / 
absence of the optimal vectors via an instrument or device. The measurement 
of term occurrence is straightforward because term occurrence is a physical 
property measured through an instrument or device. (A program that reads 
texts and writes frequencies is sufficient.) The measurement of the optimal 
vectors is much more difficult because to our knowledge any physical property 
does not correspond to an optimal vector. 

Despite the difficulty of measuring optimal vectors, one of the main ad- 
vantages of vector probability and the results reported in this paper is that 
the effort to design the oracle for any other medium than text is comparable 
to the effort to design the oracle for text because the limits to observability 
of the features corresponding to the optimal vectors are actually those un- 
dergone when the informative content of images, video and music must be 
represented. Thus, the question is: what should we observe from a document 
so that the outcome corresponds to the optimal vector? The question is not 
futile because the answer(s) would effect automatic indexing and retrieval. 

10 Related Work 

Van Rijsbergen's book [13] is the point of departure of our work. It in- 
troduced a formalism based on the Hilbert spaces for representing the IR 



models within a uniform framework. As the Hilbert spaces have been used 
for formalizing Quantum Theory, the book has also suggested the hypothe- 
sis that quantum phenomena have their analogues in IR. In this paper, we 
are not much interested in investigating whether quantum phenomena have 
their analogues in IR, in contrast, we use Hilbert vector spaces for describing 
probabilistic IR models and for defining more powerful retrieval functions. 

The latter use of vector spaces in our paper hinges on Helstrom's book [I], 
which provides the theoretical foundation for the vector probability and the 
optimal vectors. In particular, it deals with optical communication and the 
detectability of optical signals and the improvement of the radio frequency- 
based methods with which their parameters can be estimated. Within this 
domain, Helstrom provides the foundations of Quantum Theory for decid- 
ing among alternative probability distributions (e.g. relevance versus non- 
relevance, in this paper). In this paper, we point to a parallel between signal 
detection and relevance detection by corresponding the need to ferret weak 
signals out of random background noise to the need to ferret relevance out of 
term occurrence. Thus, in this paper, Quantum Theory plays the role of en- 
larging the horizon of the possible probability distributions from the classical 
mixtures used to define classical distributions to quantum superpositions [7J, 
although decision under conditions of uncertainty can still be treated by the 
theory of statistical decisions developed by, for example, [8] and used in IR 
too. 

Eldar and Forney's paper [TJ gives an algorithm for computing the optimal 
vectors and obtains a new characterization of optimal measurement, and 
prove that it is optimal in a least-squares sense. \X\ 2 is the distance between 
densities defined in [H] and is implemented as the squared cosine of the 
angle between the subspaces corresponding to the relevance vectors. The 
justification of viewing \X\ 2 as a distance comes from the fact that "the 
angle in a Hilbert space is the only measure between subspaces, up to a 
constant factor, which is invariant under all unitary transformations, that is, 
under all possible time evolutions." [H] The latter is the justification given 
in [T3] of the use of Born's rule for computing what we call vector probability. 

Hughes' book [5] is an excellent introduction to Quantum Theory. In 
particular, it addresses incompatibility between observables - we have used 
that explanation to illustrate the difficulty in implementing the oracle of 



Figure 1(b) However, in (5], there is no mention of optimal vectors. An 
introduction to quantum phenomena (i.e., interference, superposition, and 
entanglement) and Information Retrieval can be found in [7J. In contrast, we 



do not address quantum phenomena because our aims is to leverage vector 
space properties in conjunction with probability. 

In the IR literature, Quantum Theory is receiving more and more in- 
terest. In j9] the authors propose quantum formalism for modeling some IR 
tasks and information need aspects. In contrast, our paper does not limit the 
research to the application of an abstract formalism, but exploits the formal- 
ism to illustrate how the optimal vectors significantly improve effectiveness. 
In [15], the authors propose \X\ 2 for modifying probability of relevance; \X\ 2 
in conjunction with a cosine of the angle of a complex number are intended 
to model quantum correlation (also known as interference) between relevance 
assessments. The implementation of interference is left to the experimenter 
and that paper provides some suggestions. While [E>] shows that vector prob- 
ability induces a different PRP (called Quantum PRP), this paper shows that 
vector probability always induces a more powerful ranking than PRP. 

11 Conclusions 

The research in IR has been traditionally concentrated on extracting and 
combining evidence as accurately as possible in the belief that the observed 
features (e.g., term occurrence, word frequency) have to ultimately be scalars 
or structured objects. The quest for reliable, effective, efficient retrieval al- 
gorithms requires to implement the set of features as best one can. The 
implementation of a set of features is thus an "answer" to an implicit "ques- 
tion", that is, which is the best set of features for achieving effectiveness 
as high as possible? However, the research in IR often yields incremental 
results, thus arising the need to achieve an even better answer. To this end, 
we suggest to ask another "question" : Which is the best vector subspacel 
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Definitions and concepts 

Definition 1 (Probability Distribution) A probability distribution maps 
observable values to the real range [0, 1]. As usual, the probabilities are not 
negative and sums to 1. 

Definition 2 (Classical Probability Distribution) A classical probabil- 
ity distribution admits only sets of values. 

The subsets of values can be defined by means of the set operations (i.e., 
intersection, union, complement). Thus, one can compute, for instance, the 
set of relevant documents with a given term frequency. 

Definition 3 (Probability of Detection) It is the probability that a de- 
tector decides for relevance when relevance is true; it is called (expected) recall 
in IR. 

Definition 4 (Probability of False Alarm) It is the probability that a 
detector decides for relevance when relevance is false; it is called (expected) 
fallout in IR. 

Definition 5 (Region of Acceptance) It is the set of the observable val- 
ues that induce the system to decide for relevance. The most powerful region 
of acceptance yields the maximum probability of detection for a fixed proba- 
bility of false alarm. 

For example, a region of acceptance is a set of term frequencies. The Neyman- 
Pearson lemma states that the maximum likelihood ratio test defines the most 
powerful region of acceptance [8]. 

Definition 6 (Probability of Correct Decision) 

p c =e(i-p ) + (i-e)^ (is) 

provided that £ the prior probability of non-relevance, Pq is the probability of 
false alarm and Pd is the probability of detection. 



Definition 7 (Probability of Error) 



P e = £P + (1-0(1-^) 



(17) 



Of course, P e + P c = 1. In the following, we adopt the Dirac notation to write 
vectors so that the reader may refer to the literature on Quantum Theory; a 
brief illustration of the Dirac notation is in [13] . 

Definition 8 (Vector Space) A vector space over a field J 7 is a set of vec- 
tors subject to linearity, namely, a set such that, for every vector \u) , there 
are three scalars a,b,c G T and three vectors \v), \x), \y) of the same space 
such that \u) = a\v) and \u) = b\x) + c\y) . If \u) is a vector, (u\ is its 
transpose, (v\u) is the inner product with \v) and \v)(u\ is the outer product 
with \v). A projector P is a linear operator acting on a vector space such that 
P n = P for every n > 0. In particular, \u)(u\ is the projector to the subspace 
spanned by \u). If\{x\x)\ 2 = 1, the vector is normal. If (x\y) = 0, the vectors 
are mutually orthogonal. A subspace is a span of one or more subspaces if its 
projector is a linear combination of the projectors of the latter; for example, 
a ray is a span of a vector, a plane is a span of two rays (or vectors), and so 
on. 

Definition 9 (Random Variable) In classical probability, a random vari- 
able is a collection of values and of sets. The sets are mutually disjoint and 
1:1 correspondence with the values. 

Proof of Proposition [1] Suppose that p(x; m) is the probability that fre- 
quency x is observed given a parameter m corresponding to relevance. Note 
that m may refer to more than one parameter. However, we assume that m 
is scalar for the sake of clarity. In the event of binary relevance, m is either 
mo (non- relevance) or mi (relevance). The expressions 
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(18) 



establish the relationship between classical probability distributions and vec- 
tor probability, namely, between the parameters niQ,m\, the relevance vec- 
tors |mo), |77ii) and the observable X. The sign of a(x;m) is chosen so that 
the orthogonality between the relevance vectors is retained. Moreover, the 
orthogonality of the relevance vectors and the following expression 
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due to orthogonality 



(19) 

(20) 
(21) 



establish the relationship between classical and vector probability of rele- 
vance. 



Proof of Theorem [T] Consider Figures 2(a) and 2(b) A probability of 



detection p d and a probability of false alarm p defines the coordinates of \m ) 
and \mi) with a given orthonormal basis |eo), |ei) (that is, an observable): 



\m ) 
|mi) 



/I -Pd\e ) + y/Pd\ei) 
/Po\e ) + a/1 -p |ei) 

The coordinates are expressed in terms of angles: 

1 



Pd 



sin 2 rji 



Po 



sin 2 rjo 



(22) 
(23) 



(24) 



provided that rji is the angle between \rrii) and |ei). 
The probability of error is 

Pe = &o + (1 - 0(1 - Pd) = £sin 2 Tfo + (1 - e) sin 2 Vl (25) 

The probability of error is minimum when r] = rji = 9 as shown in [H page 
99]. 

But, 6 is exactly the angle between \rrii),i = 0, 1 and and is defined 
as a result of Equation ([3]). The probability of error is then minimized when 
the observable vectors are the \fJLi),i = 0, 1. 

Therefore, Q e < P e for all P e , that is, for all the observable vectors. As 
Q c = 1 — Q e , the probability of correct decision is also maximum. 
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Table 1: Average relative frequency per topic. 



(a) Non-optimal vectors are mutually orthogonal and asymmetrically placed around 
|mo), \mi) 




(b) Optimal vectors are mutually orthogonal and symmetrically placed around \mo), |toi) 
Figure 2: Geometry of decision and vectors 
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Figure 3: An IR system equipped with an oracle based on vector probability 
estimated with BM25. A document emits the saturation value if the term 
occurs, otherwise. The system B trains the probability distributions using 
the saturation values. Rather that applying logarithms and computing the 
BM25 weights, B invokes the oracle that produces the vectors. 




Figure 4: P e and Q e plotted against £ for word crime of topic 301. 




Figure 5: P e and Q e plotted against £ for topic 301. 




Figure 6: P e and Q e plotted against £ for topic 344. 



Figure 8: The difference between subsets and vector subspaces. The span of 
\y) and \x) yields the vertical plane, which is spanned by |ei) and |e 2 ) too. 
If |e 2 ) is intersected by the plane, the result is |e 2 ). But, if |e 2 ) is intersected 
by \y) and then by \x), the span of the two intersections is 0. That is, the 
distributive law is not admitted by vectors. 



